TDT 2001 Topic Tracking at RMIT University
نویسندگان
چکیده
This is RMIT’s first year of participation in the TDT evaluation. Our system uses a linear classifier to track topics and an approach based on our previous work in document routing. We aimed this year to develop a baseline system, and to then test selected variations, including adaptive tracking. Our contribution this year to have implemented an efficient system, that is, one that maximises tracking document throughput and minimises system overheads. 1. Topic Tracking Objectives For this year’s evaluation we participated only in the topic tracking task. Given a few example stories on a particular topic, the task of a tracking system task is to process a supplied list of data files, and classify all stories in these files as either on-topic or off-topic. This task is repeated for over 100 topics. There are few examples provided for tracking a topic. Under the TDT evaluation rules, only one, two, or four on-topic examples are provided, and only zero or two off-topic examples. To simulate field conditions for a tracking system, no data may be used that is as recent, or more recent than the data supplied. Therefore, any term statistics or pseudo-relevance feedback used must be gathered from sources published before 1998. 1.1. System Objectives For this year’s evaluation, we aimed to implement a baseline system that would produce reasonable, TDT-compliant results. We restricted our problem domain to English language documents and automatically-translated Mandarin; we did not consider any ASR texts, and only processed newswire and manually-transcribed broadcast news stories. Our secondary goals were efficiency—a similar goal to our recent categorisation research [7]—and to experiment with various term extraction and profiling techniques to see which work better and why. 2. Implementation Details The system we used was based on a Rocchio [6] routing categoriser we previously developed for fast document categorisation [7]. The system was developed in C and experiments were carried out on a Pentium III based server with 256 megabytes of main-memory. In this environment, our system can process around 10 gigabytes of SGML-formatted text per hour. Terms were extracted as single words, delimited by whitespace. All terms extracted were case-folded. Words were then stemmed using Porter’s algorithm [4]. We found that use of a Porter stemmer is a significant overhead: the throughput rate of our system was reduced to half when the stemmer was employed. We plan in the future to use a simpler or dictionary stemmer for efficiency. A list of 477 common English words was used as a stoplist. As a background model of news document text, we used newswire documents from Associated Press, Wall Street Journal, and the Federal Register from TREC disks 1 and 2 [1]. Term document frequency statistics were gathered and stored to an auxiliary vocabulary file. Along with the document frequency (DF ), we store the Okapi [5] inverse document frequency of these terms: IDF = ln N df + 0:5 df + 0:5 In cases where df < N2 , resulting in a negative inverse document frequency (IDF), we set IDF to zero. A document term frequency (TF) factor is calculated using an Okapi term weight: TF = tf 0:5 + 1:5 dl dlavg + tf where dl is the length of the document, dlavg is the mean document length in the auxiliary corpus, and tf is the number of times the term occurs in the current document. This factor is multiplied by the IDF value for that term to produce a weight w: w = TF:IDF The weight of each term in the document is stored along with the index of that term in the dictionary to form a vector of weights. New terms observed while processing the TDT data files are added to the term dictionary with a DF = 1. Topics profiles are maintained for each topic. The topic profile is calculated initially by generating weight vectors for each of the training examples supplied, then adding the positive, and subtracting the negative examples, to form one vector. Following the well-known heuristic [2, 3], negative weights in the profile are zeroed; the Rocchio parameters and are therefore both set to 1.0. The resultant vector is then normalised to unit size. The similarity sim between the topic profile vector, u, and a document vector, v, are calculated using the well-known cosine measure [8]: sim = os ( ) = u:v juj:jvj The threshold for classification decisions is set by calculating the cosine measure of each of the training examples against the topic profile. The maximum off-topic and minimum on-topic values are recorded and a threshold calculated according to the following formula: thresh = ( offmax > onmin : onmin offmax < onmin : offmax + onmin offmax TWB Where TWB is a threshold weight bias, which defaults to 4, but can be set higher in order to lower the threshold and give better recall. A similarity higher than the threshold results in a YES decision for that document, while a similarity lower than the threshold results in a NO. 2.1. Adaptive Profiling Simple, adaptive profiling was built into the system as an optional feature. With adaptive profiling enabled, the system performs unsupervised learning. Documents that are selected as on-topic—those with a YES decision—are added to the topic profile. Documents that are off-topic—those with a NO decision—are subtracted from the topic profile. A fixed number of the highest weighted terms are retained in an active profile which is used for performing similarity calculations. In our initial implementation described here, translated Mandarin documents were not used in adaptive feedback as it was found that the limited vocabulary in these documents results in skewing of the profile towards recognising all Mandarin documents as on-topic examples. Document vectors that are used to adapt the profile are scaled-down to reduce the possibility of a false decision distorting the profile. A certainty factor is calculated for each decision made as follows: ert =8><>: sim > thresh : posmax sim thresh posmax thresh : sim < thresh : negmin thresh sim thresh negmin Where posmax and negmin are the maximum positive and minimum negative similarity values from the training examples. The document vector is normalised, then multiplied by the certainty before addition to the profile vector. The document vector can be further scaled-down using a scale factor. In the runs submitted to the TDT evaluation, we scaled these down by a factor of five; this was an arbitrary choice and we plan to experiment with this approach further in the future. 2.2. Pseudo Relevance Feedback Optional pseudo-relevance feedback (PRF) is also included in the system, as a method to include more related terms and raise recall. Three gigabytes of TREC data were chosen as a source of PRF. PRF is introduced into the system after training is complete and the topic profile has been normalised, but before a threshold has been chosen. The MG [8] full-text information retrieval engine (mg-2.1) is used to generate PRF. A ranked query is constructed from the highestweighted three terms in the topic profile, and the top ten documents returned by a ranked query are processed into document vectors. These document vectors are then added to the topic profile in the same manner as adaptive feedback documents, except that certainty is set to the same value as similarity. 3. Experiments Prior to the release of the TDT2001 evaluation corpus and indexes, the TDT3 data and TDT2001 dry run indexes were used to compare relative system performance under various conditions. All runs performed used English training documents, reference document boundaries, English and translated Mandarin test documents, and manual transcripts of audio sources. The following control files from the dry run indexes were used: trk_SR=nwt+bnman_TR=eng_TE=mul,eng_Nt=1.ctl trk_SR=nwt+bnman_TR=eng_TE=mul,eng_Nt=2.ctl trk_SR=nwt+bnman_TR=eng_TE=mul,eng_Nt=4.ctl Several experiments were performed using Nt = f1j2j4g and Nn = f0j2g. It was found that Nt = 4; Nn = 0 gives the best results, but possibly only because the lack of negative training examples causes the final threshold to be lower. In order to discriminate against irrelevant terms appearing in the positive examples, a parameter value of Nn = 2 was used for four of the five runs we submitted for the final evaluation. 3.1. Experiment Conditions Runs were submitted for evaluation using the parameters shown in Table 1. Run Nt Nn TWB Adap. Prof. Size PRF Docs a5042 4 2 4 50 – p42 4 2 4 – 10 st42 4 2 4 – – th842 4 2 8 – – p10 1 0 10 – 10 Table 1: Run parameters for RMIT runs submitted. 4. Results A comparison of results of the runs shown in Table 1 is shown in Table 2 (these are taken from the official NIST TDT evaluation). Statistics displayed are topic weighted and macroaveraged. Run CDET Pmiss Pfa a5042 0.5171 0.5141 0.0006 p42 0.3809 0.3697 0.0023 st42 0.5267 0.5252 0.0003 th842 0.3839 0.3766 0.0015 p10 0.4764 0.4575 0.0038 Table 2: RMIT TDT 2001 Topic Tracking Results Table 2 shows that our false alarm rates are very low, while the miss rates are high, resulting in poor CDET scores. We believe that this is due to two factors: 1. Term expansion is not performed. This may cause our system to miss many documents because we do not have synonyms and related words that occur in on-topic test set documents 2. Our thresholding algorithm is simple, and appears to set thresholds somewhat higher than the optimum value In Table 3, it can be seen that by lowering our threshold, we could have achieved less misses and more false alarms, with an overall improvement in our performance. Run min (CDET ) min (Pmiss) min (Pfa) a5042 0.2785 0.2160 0.0128 p42 0.2244 0.1767 0.0097 st42 0.1939 0.1435 0.0103 th842 0.1939 0.1435 0.0103 p10 0.3636 0.2850 0.0160 Table 3: RMIT TDT 2001 Minimum Topic Tracking Results Our primary evaluation condition run is shown in Figure 1. Figure 1: Top 10 pseudo-relevant documents added to profile (run “p10”). Our best evaluation run was p42 and this is shown plotted in Figure 2. Figure 2: Top 10 pseudo-relevant documents added to profile (run “p42”). 5. Discussion of Results Our runs had high precision, as evidenced by the low false alarm rate. However, recall is the area of deficiency in our runs and this results in only moderate performance overall. One immediate improvement on recall, which can be observed by comparing runs st42 and th842, is to lower the decision threshold. Our best run used pseudo-relevance feedback. However, a closer examination of the minimum achievable CDET shows that this is only because of the threshold point chosen. It appears that the th842 and st42 runs still have the best minimum CDET . Normalising topic score distributions may prove an effective way to solve this problem, while a better source of pseudo-relevance feedback may be an older TDT corpus rather than TREC. Adaptive profiling appears to have been partly successful, but it is apparent that there is still significant improvement possible. In particular, we have found that it is difficult to balance positive and negative feedback, and to keep the system stable. One obvious improvement would be addition of the ability to recalculate the threshold as the profile changes. Static profiling has the advantage of being simpler to implement, and faster to run, as we skip all the steps of sorting and renormalising the active profile. However, static profiling is limited in that it cannot handle “concept drift” in a topic. We believe that it may be necessary to use adaptive profiling in order to improve our results further. The adaptive profiling system used here implements a linear scaling function for profile adaptation, based on the certainty of a document being on or off topic. A better method may be to use a sigmoid function, so that borderline documents are totally ignored. 6. Conclusion and Future Directions For this year’s TDT evaluation, we developed a baseline topic tracking system. Some variations on the baseline were attempted, including adaptive tracking and pseudo-relevance feedback. While our system was efficient and based on our previous work in fast categorisation, the accuracy of the system can be improved substantially. There are many possible future directions. The most promising directions for increasing accuracy are to improve the thresholding algorithm, normalise topic relevance scores, ambiguate topic profiles, apply higher weightings to proper names, and to recognise some multiword features. We also plan additional efficiency work.
منابع مشابه
Topic styles in IR and TDT: effect on system behavior
The TREC Spoken Document Retrieval Track (SDR) and the Topic Detection and Tracking (TDT) project have annotated the same corpus with difference styles of relevance judgements, using differenct notions of topic. We compare the behavior of a topic tracking system using relevance judgements from TDT with that of the same system using relevance from the SDR in order to investigate the influence of...
متن کاملUsing language models for tracking events of interest over time
This paper presents the TNO tracking system which was evaluated at the 2000 Topic Detection and Tracking evaluation project (TDT2000). The objective of the TDT tracking task is to track events of interest over time. We built a baseline tracking system based on a language modeling approach. This approach had proved to be powerful for the TREC adaptive filtering task and several other IR tasks.
متن کاملTopic Detection and Tracking Pilot Study Final Report
Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories. The TDT problem consists of three major tasks: (1) segmenting a stream of data, especially recognized speech, into distinct stories; (2) identifying those news stories that are the first to discuss a new event occurring...
متن کاملQuality Control in Large Annotation Projects Involving Multiple Judges: The Case of the TDT Corpora
The Linguistic Data Consortium at the University of Pennsylvania has recently been engaged in the creation of large-scale annotated corpora of broadcast news materials in support of the ongoing Topic Detection and Tracking (TDT) research project. The TDT corpora were designed to support three basic research tasks: segmentation, topic detection, and topic tracking in newswire, television and rad...
متن کاملTopic Detection & Tracking (TDT) Overview & Perspective
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data (e.g., newswire and broadcast news). Work on TDT began about a year ago, is now expanding, and will be a regular feature at future Broadcast News workshops.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001